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ABSTRACT 

Oral  communication  is  ubiquitous  and  carries  important  in¬ 
formation  yet  it  is  also  time  consuming  to  document.  Given 
the  development  of  storage  media  and  networks  one  could 
just  record  and  store  a  conversation  for  documentation.  The 
question  is,  however,  how  an  interesting  information  piece 
would  be  found  in  a  large  database.  Traditional  informa¬ 
tion  retrieval  techniques  use  a  histogram  of  keywords  as  the 
document  representation  but  oral  communication  may  offer 
additional  indices  such  as  the  time  and  place  of  the  rejoinder 
and  the  attendance.  An  alternative  index  could  be  the  ac¬ 
tivity  such  as  discussing,  planning,  informing,  story-telling, 
etc.  This  paper  addresses  the  problem  of  the  automatic  de¬ 
tection  of  those  activities  in  meeting  situation  and  everyday 
rejoinders.  Several  extensions  of  this  basic  idea  are  being 
discussed  and/or  evaluated:  Similar  to  activities  one  can 
define  subsets  of  larger  database  and  detect  those  automati¬ 
cally  which  is  shown  on  a  large  database  of  TV  shows.  Emo¬ 
tions  and  other  indices  such  as  the  dominance  distribution 
of  speakers  might  be  available  on  the  surface  and  could  be 
used  directly.  Despite  the  small  size  of  the  databases  used 
some  results  about  the  effectiveness  of  these  indices  can  be 
obtained. 
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1.  INTRODUCTION 

Information  access  to  oral  communication  is  becoming  an 
interesting  research  area  since  recording,  storing  and  trans¬ 
mitting  large  amounts  of  audio  (and  video)  data  is  feasible 
today.  While  written  information  is  often  available  elec¬ 
tronically  (especially  since  it  is  typically  entered  on  com¬ 
puters)  oral  communication  is  usually  only  documented  by 
constructing  a  new  document  in  written  form  such  as  a  tran¬ 
script  (court  proceedings)  or  minutes  (meetings).  Oral  com¬ 
munications  are  therefore  a  large  untapped  resource,  espe¬ 
cially  if  no  corresponding  written  documents  are  available 
and  the  cost  of  documentation  using  traditional  techniques 
is  considered  high:  Tutorial  introductions  by  a  senior  staff 
member  might  be  worthwhile  to  attend  by  many  newcomers, 
office  meetings  may  contain  informations  relevant  for  oth¬ 
ers  and  should  be  reproducable,  informal  and  formal  group 
meetings  may  be  interesting  but  not  fully  documented.  In 
essence  the  written  form  is  already  a  reinterpretation  of  the 
original  rejoinder.  Such  a  reinterpretation  are  used  to 

•  extract  and  condense  information 

•  add  or  delete  information 

•  change  the  meaning 

•  cite  the  rejoinder 

•  relate  rejoinders  to  each  other 

Reinterpretation  is  a  time  consuming,  expensive  and  op¬ 
tional  step  and  written  documentation  is  combining  reinter¬ 
pretation  and  documentation  step  in  one  ^ .  If  however  rein¬ 
terpretation  is  not  necessary  or  unwanted  a  system  which 
is  producing  audiovisual  records  is  superior.  If  reinterpreta¬ 
tion  is  wanted  or  needed  a  system  using  audiovisual  records 
may  be  used  to  improve  the  reinterpretation  by  adding  all 
audiovisual  data  and  the  option  to  go  back  to  the  unaltered 
original.  Whether  reinterpretation  is  done  or  not  it  is  cru¬ 
cial  to  be  able  to  navigate  effectively  within  an  audiovisual 
document  and  to  find  a  specific  document. 

^The  most  important  exception  is  the  literal  courtroom 
transcript,  however  one  could  argue  that  even  transcripts 
are  reinterpretations  since  they  do  not  contain  a  number  of 
informations  present  in  the  audio  channel  such  as  emotions, 
hesitations,  the  use  of  slang  and  certain  types  of  hetereglos- 
sia,  accents  and  so  forth.  This  is  specihcally  true  if  tran¬ 
scription  machines  are  used  which  restrict  the  transcriber 
to  standard  orthography. 
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Figure  1:  Information  access  hierarchy:  Oral  com¬ 
munications  take  place  in  very  different  formats  and 
the  first  step  in  the  search  is  to  determine  the 
database  (or  sub-database)  of  the  rejoinder.  The 
next  step  is  to  find  the  specific  rejoinder.  Since  re¬ 
joinders  can  be  very  long  the  rejoinder  has  to  seg¬ 
mented  and  a  segment  has  to  be  selected. 


While  keywords  are  commonly  used  in  information  ac¬ 
cess  to  written  information  the  use  of  other  indices  such  as 
style  is  still  uncommon  (but  see  Kessler  et  al.  (1997);  van 
Bretan  et  al.  (1998)).  Oral  communication  is  richer  than 
written  communication  since  it  is  an  interactive  real  time 
accomplishment  between  participants,  may  involve  speech 
gestures  such  as  the  display  of  emotion  and  is  situated  in 
space  and  time.  Bahktin  (1986)  characterizes  a  conversa¬ 
tion  by  topic,  situation  and  style.  Information  access  to 
oral  communication  can  therefore  make  use  of  indices  that 
pertain  to  the  oral  nature  of  the  discourse  (Fig.  2).  In¬ 
dices  other  than  topic  (represented  by  keywords)  increase 
in  importance  since  browsing  audio  documents  is  cumber¬ 
some  which  makes  the  common  interactive  retrieval  strategy 
“query,  browse,  reformulate”  less  effective.  Finally  the  topic 
may  not  be  known  at  all  or  may  not  be  that  relevant  for  the 
query  formulation,  for  example  if  one  just  wants  to  be  re¬ 
minded  what  was  being  discussed  last  time  a  person  was 
met.  Activities  are  suggested  as  an  alternative  index  and 
are  a  description  of  the  type  of  interaction.  It  is  common 
to  use  “action-verbs”  such  as  story-telling,  discussing,  plan¬ 
ning,  informing,  etc.  to  describe  activities  Items  similar 
to  activities  have  been  shown  to  be  directly  retrievable  from 
autobiographic  memory  (Herrmann,  1993)  and  are  therefore 
indices  that  are  available  to  participants  of  the  conversation. 
Other  indices  may  be  very  effective  but  not  available:  The 
frequency  of  the  word  “I”  in  the  conversation,  the  histogram 
of  word  lengths  or  the  histogram  of  pitch  per  participant. 

In  Fig.  1  the  information  access  hierarchy  is  being  intro¬ 
duced  which  allows  to  understand  the  problem  of  informa¬ 
tion  access  to  oral  communication  at  different  levels.  In 
Ries  (1999)  we  have  shown  that  the  detection  of  general  di- 


^  The  definition  of  activities  such  as  planning  may  vary 
vastly  across  general  dialogue  genres,  for  example  compare 
a  military  combat  situation  with  a  mother  child  interaction. 
However  it  is  often  possible  to  develop  activities  and  dia¬ 
logue  typologies  for  a  specific  dialogue  genre.  The  related 
problem  of  general  typologies  of  dialogues  is  still  far  from 
being  settled  and  action-verbs  are  just  one  potential  catego¬ 
rization  (Fritz  and  Hundschnur,  1994). 


Style  Situation 

Figure  2:  Bahktin’s  characterization  of  dialogue: 
Bahktin  (1986)  describes  a  discourse  along  the  three 
major  properties  style,  situation  and  topic.  Current 
information  retrieval  systems  focus  on  the  topical  as¬ 
pect  which  might  be  crucial  in  written  documents. 
Furthermore,  since  throughout  text  analysis  is  still  a 
hard  problem,  information  retrieval  has  mostly  used 
keywords  to  characterize  topic.  Many  features  that 
could  be  extracted  are  therefore  ignored  in  a  tradi¬ 
tional  keyword  based  approach. 


alogue  genre  (database  level  in  Fig.  1)  can  be  done  with 
high  accuracy  if  a  number  of  different  example  types  have 
been  annotated;  in  Ries  et  al.  (2000)  we  have  shown  that  it 
is  hard  but  not  impossible  to  distinguish  activities  in  per¬ 
sonal  phone  calls  (segment  level  in  Fig.  1)  .  In  this  paper 
we  will  address  activities  in  meetings  and  other  types  of  di¬ 
alogues  and  show  that  these  activities  can  be  distinguished 
using  certain  features  and  a  neural  network  based  classifier 
(Sec.  2,  segment  level  in  Fig.  1).  The  concept  of  information 
retrieval  assessment  using  information  theoretic  measures  is 
applied  to  this  task  (Sec.  3).  Additionally  we  will  introduce 
a  level  somewhat  below  the  database  level  in  Fig.  1  that  we 
call  “sub-genre”  and  we  have  collected  a  large  database  of 
TV-shows  that  are  automatically  classified  for  their  show- 
type  (Sec.  4).  We  also  explore  whether  there  are  other  in¬ 
dices  similar  to  activities  that  could  be  used  and  we  are 
presenting  results  on  emotions  in  meetings  (Sec.  5). 

2.  ACTIVITY  DETECTION 

We  are  interested  in  the  detection  of  activities  that  are 
described  by  action  verbs  and  have  annotated  those  in  two 
databases: 

meetings  have  been  collected  at  Interactive  Systems  Labs  at 
CMU  (Waibel  et  ah,  1998)  and  a  subset  of  8  meetings 
has  been  annotated.  Most  of  the  meetings  are  by  the 
data  annotation  group  itself  and  are  fairly  informal  in 
style.  The  participants  are  often  well  acquainted  and 
meet  each  other  a  lot  besides  their  meetings. 

Santa  Barbara  (SBC)  is  a  corpus  released  by  the  LDC 
and  7  out  of  12  rejoinders  have  been  annotated. 

The  annotator  has  been  instructed  to  segment  the  rejoin¬ 
ders  into  units  that  are  coherent  with  respect  to  their  topic 


Activity 

SBC 

Meeting 

Discussion 

35 

58 

Information 

25 

23 

Story-telling 

24 

10 

Planning 

7 

19 

Undetermined 

5 

8 

Advising 

5 

17 

Not  meeting 

3 

2 

Interrogation 

2 

1 

Evaluation 

1 

0 

Introduction 

0 

1 

Closing 

0 

1 

Table  1:  Distribution  of  activity  types:  Both 
databases  contain  a  lot  of  discussing,  informing  and 
story-telling  activities  however  the  meeting  data 
contains  a  lot  more  planning  and  advising. 


and  activity  and  annotate  them  with  an  activity  which  fol¬ 
lows  the  intuitive  definition  of  the  action-verb  such  as  dis¬ 
cussing,  planning,  etc.  Additionally  an  activity  annotation 
manual  containing  more  specific  instructions  has  been  avail¬ 
able  (Ries  et  ah,  2000;  Thyme-Gobbel  et  ah,  2001)  The 
list  of  tags  and  the  distribution  can  be  seen  in  Tab.  1.  The 
set  of  activities  can  be  clustered  into  “interactive”  activities 
of  equal  contribution  rights  (discussion,planning),  one  per¬ 
son  being  active  (advising,  information  giving,  story-telling), 
interrogations  and  all  others. 


Measure 

Meeting 

SBC 

CallHome 

all  inter 

all 

inter 

Spanish 

K 

0.41  0.51 

0.49 

0.56 

0.59 

Mutual  inf. 

0.35  0.25 

0.65 

0.32 

0.61 

Table  2:  Intercoder  agreement  for  activities:  The 
meeting  dialogues  and  Santa  Barbara  corpus  have 
been  annotated  by  a  semi-naive  coder  and  the  first 
author  of  the  paper.  The  K-coefficient  is  determined 
as  in  Carletta  et  al.  (1997)  and  mutual  information 
measures  how  much  one  label  “informs”  the  other 
(see  Sec.  3).  For  CallHome  Spanish  3  dialogues  were 
coded  for  activities  by  two  coders  and  the  result 
seems  to  indicate  that  the  task  was  easier. 

Both  datasets  have  been  annotated  not  only  by  a  semi- 
naive  annotator  but  also  by  the  first  author  of  the  paper. 
The  results  for  K-statistics  (Carletta  et  ah,  1997)  and  mu¬ 
tual  information  between  the  coders  can  be  seen  in  Tab.  2. 
The  intercoder  agreement  would  be  considered  moderate  but 
compares  approximately  to  Carletta  et  al.  (1997)  agreement 
on  transactions  (k  =  0.59),  especially  for  the  interactive  ac¬ 
tivities  and  CallHome  Spanish. 

For  classification  a  neural  network  was  trained  that  uses 
the  softmax  function  as  its  output  and  KL-divergence  as 

®  In  contrast  to  (Ries  et  ah,  2000;  Thyme-Gobbel  et  ah, 
2001)  the  “consoling”  activity  has  been  eliminated  and  an 
“informing”  activity  has  been  introduced  for  segments  where 
one  or  more  than  one  member  of  the  rejoinder  give  informa¬ 
tion  to  the  others.  Additionally  an  “introducing”  activity 
was  added  to  account  for  a  introduction  of  people  or  topics 
at  the  beginning  of  meetings. 


Feature 

SBC 

all 

meet 

interactive 
SBC  meet 

baseline 

32.7 

41.1 

50.5 

54.6 

dialogue  acts  per  channel 

28.1 

37.6 

47.7 

56.7 

dialogue  acts 

28.0 

36.2 

46.7 

65.3 

words 

38.3 

39.7 

53.3 

54.6 

dominance 

32.7 

44.7 

64.5 

58.2 

style 

24.3 

35.5 

53.3 

58.9 

style  -I-  words 

42.1 

38.3 

52.3 

57.5 

dominance  -I-  words 

41.1 

41.1 

52.3 

58.9 

dominance  -I-  style  -I-  words 

42.1 

39.7 

53.3 

60.3 

dialogue  acts  -I-  words 

42.1 

37.6 

57.0 

61.0 

dialogue  acts  -|-  style  -|-  words 

39.3 

40.4 

57.9 

61.0 

Wordnet 

37.4 

37.6 

46.7 

52.5 

Wordnet  -I-  words 

49.5 

39.0 

53.3 

57.5 

first  author 

59.8 

57.9 

73.8 

72.7 

Table  3:  Activity  detection:  Activities  are  detected 
on  the  Santa  Barbara  Corpus  (SBC)  and  the  meet¬ 
ing  database  (meet)  either  without  clustering  the 
activities  (all)  or  clustering  them  according  to  their 
interactivity  (interactive)  (see  Sec.  2  for  details). 


the  error  function.  The  network  connects  the  input  di¬ 
rectly  to  the  output  units.  Hidden  units  have  not  been  used 
since  they  did  not  yield  improvements  on  this  task.  The 
network  was  trained  using  RPROP  with  momentum  (Ried- 
miller  and  Braun,  1993)  and  corresponds  to  an  exponen¬ 
tial  model  (Nigam  et  al.,  1999).  The  momentum  term  can 
be  interpreted  as  a  Gaussian  prior  with  zero  mean  on  the 
network  weights.  It  is  the  same  architecture  that  we  used 
previously  (Ries  et  al.,  2000)  for  the  detection  of  activities 
on  CallHome  Spanish.  Although  some  feature  sets  could  be 
trained  using  the  iterative  scaling  algorithm  if  no  hidden 
units  are  being  used  the  training  times  weren’t  high  enough 
to  justify  the  use  of  the  less  flexible  iterative  scaling  algo¬ 
rithm.  The  features  used  for  classification  are 

words  the  50  most  frequent  words  /  part  of  speech  pairs 
are  used  directly,  all  other  pairs  are  replaced  by  their 
part  of  speech 

stylistic  features  adapted  from  Biber  (1988)  and  contain 
mostly  syntactic  constructions  and  some  word  classes. 

Wordnet  a  total  of  40  verb  and  noun  classes  (so  called  lex¬ 
icographers  classes  (Fellbaum,  1998))  are  defined  and 
a  word  is  replaced  by  the  most  frequent  class  over  all 
possible  meanings  of  the  word. 

dialogue  acts  such  as  statements,  questions,  backchannels, 
. . .  are  detected  using  a  language  model  based  detec¬ 
tor  trained  on  Switchboard  similar  to  Stolcke  et  al. 
(2000)  ® 

^Klaus  Zechner  trained  an  English  part  of  speech  tagger 
tagger  on  Switchboard  that  has  been  used.  The  tagger  uses 
the  code  by  Brill  (1994). 

®The  model  was  trained  to  be  very  portable  and  therefore 
the  following  choices  were  taken:  (a)  the  dialogue  model 
is  context-independent  and  (b)  only  the  part  of  speech  are 
taken  as  the  input  to  the  model  plus  the  50  most  likely 
word/part  of  speech  types. 


dominance  is  described  as  the  distribution  of  the  speaker 
dominance  in  a  conversation.  The  distribution  is  rep¬ 
resented  as  a  histogram  and  speaker  dominance  is  mea¬ 
sured  as  the  average  dominance  of  the  dialogue  acts  (Linell 
et  al.,  1988)  of  each  speaker.  The  dialogue  acts  are  de¬ 
tected  and  the  dominance  is  a  numeric  value  assigned 
for  each  dialogue  act  type.  Dialogue  act  types  that 
restrict  the  options  of  the  conversation  partners  have 
high  dominance  (questions),  dialogue  acts  that  signal 
understanding  (backchannels)  carry  low  dominance. 

First  author  The  activities  used  for  classification  are  those 
of  the  semi-naive  coder.  The  “first  author”  column  de¬ 
scribes  the  “accuracy”  of  the  first  author  with  respect 
to  the  naive  coder. 

The  detection  of  interactive  activities  works  fairly  well 
using  the  dominance  feature  on  SBC  which  is  also  natu¬ 
ral  since  the  relative  dominance  of  speakers  should  describe 
what  kind  of  interaction  is  exhibited.  The  dialogue  act  dis¬ 
tribution  on  the  other  hand  works  fairly  well  on  the  more 
homogeneous  meeting  database  were  there  is  a  better  chance 
to  see  generalizations  from  more  specific  dialogue  based  in¬ 
formation.  Overall  the  combination  of  more  than  one  feature 
is  really  important  since  word  level,  Wordnet  and  stylistic 
information,  while  sometimes  successful,  seem  to  be  able  to 
improve  the  result  while  they  don’t  provide  good  features  by 
themselves.  The  meeting  data  is  also  more  difficult  which 
might  be  due  to  its  informal  style. 


Another  option  is  to  assume  that  the  labels  of  one  coder 
are  part  of  D.  If  the  query  by  the  other  coder  is  R  we  are  in¬ 
terested  in  the  reduction  of  the  document  entropy  given  the 
query.  If  we  furthermore  assume  that  H{R\D)  =  H{R\R') 
where  R!  is  the  activity  label  embedded  in  D\ 

H{D)  -  H{D\R)  =  H{R)  -  H{R\D)  =  MI{R,  R') 

Tab.  2  shows  that  the  labels  of  the  semi-naive  coder  and  the 
first  author  only  inform  each  other  by  0.25  —  0.65  bits.  How¬ 
ever,  since  all  constraints  are  important  to  apply,  it  might  be 
important  to  include  manual  annotations  to  be  matched  by 
a  query  or  in  a  graphical  presentation  of  the  output  results. 

Another  interesting  question  to  consider  is  whether  the 
activity  is  correlated  with  the  rejoinder  or  not.  This  ques¬ 
tion  is  important  since  a  correlation  of  the  activity  with  the 
rejoinder  would  mean  that  the  indexing  performance  of  ac¬ 
tivities  needs  to  be  compared  to  other  indices  that  apply  to 
rejoinders  such  as  attendance,  time  and  place  (for  results  on 
the  correlation  with  rejoinders  see  Waibel  et  al.  (2001)).  The 
correlation  can  be  measured  using  the  mutual  information 
between  the  activity  and  the  meeting  identity.  The  mutual 
information  is  moderate  for  SBC  (~  0.67  bit)  and  much 
lower  for  the  meetings  («  0.20  bit).  This  also  corresponds 
to  our  intuition  since  some  of  the  rejoinders  in  SBC  belong 
to  very  distinct  dialogue  genre  while  the  meeting  database 
is  homogeneous.  The  conclusion  is  that  activities  are  useful 
for  navigation  in  a  rejoinder  if  the  database  is  homogeneous 
and  they  might  be  useful  for  finding  conversations  in  a  more 
heterogeneous  database. 


3.  INFORMATION  ACCESS  ASSESSMENT 

Assuming  a  probabilistic  information  retrieval  model  a 
query  r  -  in  our  example  an  activity  -  predicts  a  docu¬ 
ment  d  with  the  probability  q{d\r)  =  Let  p{d,r) 

be  the  real  probability  mass  distribution  of  these  quanti¬ 
ties.  The  probability  mass  function  q{r\d)  is  estimated  on 
a  separate  training  set  by  a  neural  network  based  classi¬ 
fier  The  quantity  we  are  interested  in  is  the  reduction 
in  expected  coding  length  of  the  document  using  the  neural 
network  based  detector  ' : 
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The  two  expectations  correspond  exactly  to  the  measures  in 
Tab.  5,  the  first  represents  the  baseline,  the  second  the  one 
for  the  respective  classifier.  In  more  standard  information 
theoretic  notation  this  quantity  may  be  written  as: 


H{R)  -  {Hp{R\D)  +  D{pir\d)\\q{r\d))) 


This  equivalence  is  not  extremely  useful  though  since  the 
quantities  in  parenthesis  can’t  be  estimated  separately.  For 
the  small  meeting  database  and  SBC  however  no  entropy 
reductions  could  be  obtained.  On  the  larger  databases,  on 
the  other  hand,  entropy  reductions  could  be  obtained  (« 
0.5bit  on  the  CallHome  Spanish  database  Ries  et  al.  (2000), 
«  Ibit  for  the  sub-database  detection  problem  in  Sec.  4). 

®A11  quantities  involving  the  neural  net  q{r\d)  have  been 
determined  using  a  round  robin  approach  such  that  network 
is  trained  on  a  separate  training  set. 

^Since  estimating  q{d)  is  simple  we  may  assume  that  q{d)  « 

ErP(rfT)- 


# 

# 

# 

Talk 

344 

Edu 

25 

Finance 

8 

News 

217 

Scifi 

24 

Religious 

5 

Sitcom 

97 

Series 

24 

Series-Old 

3 

Soap 

87 

Cartoon 

23 

Infotain 

3 

Game 

46 

Movies 

22 

Music 

2 

Law 

32 

Crafts 

17 

Horror 

1 

Sports 

32 

Specials 

15 

Drama 

31 

Comedy 

9 

Table  4:  TV  show  types:  The  distribution  of  show 
types  in  a  large  database  of  TV  shows  (1067  shows) 
that  has  been  recorded  over  the  period  of  a  couple 
of  months  until  April  2000  in  Pittsburgh,  PA 


4.  DETECTION  OF  SUB-DATABASES 

We  set  up  an  environment  for  TV  shows  that  records  the 
subtitles  with  timestamps  continuously  from  one  TV  chan¬ 
nel  and  the  channel  was  switched  every  other  day.  At  the 
same  time  the  TV  program  was  downloaded  from  http: 
//tv .  yahoo .  com/  to  obtain  programming  information  in¬ 
cluding  the  genre  of  the  show.  Yahoo  assigns  primary  and 
secondary  show  types  and  unless  the  combination  of  pri¬ 
mary/secondary  show-type  is  frequent  enough  the  primary 
showtype  is  used  (Tab.  4).  The  TV  show  database  has  the 
advantage  that  we  were  able  to  collect  a  large  and  varied 
database  with  little  effort.  The  same  classifier  as  in  Sec.  2 
has  been  used  however  dialogue  acts  have  not  been  detected 
since  the  data  contains  a  lot  of  noise,  is  not  necessarily  con¬ 
versational  and  speaker  identities  can’t  be  determined  easily. 
Detection  results  for  TV  shows  can  be  seen  in  Tab.  5.  It  may 


be  noted  that  adding  a  lot  of  keywords  does  improve  the  de¬ 
tection  result  but  not  so  much  the  entropy.  It  may  therefore 
be  assume  that  there  is  a  limited  dependence  between  topic 
and  genre  which  isn’t  really  a  surprise  since  there  are  many 
shows  with  weekly  sequels  and  there  may  be  some  true  re¬ 
peats. 
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stylistic 

words 

accuracy 

entropy 

baseline 

32.2 

3.31 

• 

50.9 

2.73 

• 

50 

62.2 

2.33 

• 

• 

50 

60.0 

2.29 

• 

• 

61.2 

2.28 

• 

56.9 

2.41 

• 

50 

61.5 

2.25 

50 

61.3 

2.35 

250 

62.7 

2.17 

500 

66.0 

2.14 

• 

• 

500 

64.9 

2.13 

5000 

67.2 

2.08 

Table  5:  Show  type  detection:  Using  the  neural  net¬ 
work  described  in  Sec.  2  the  show  type  was  detected. 
If  there  is  a  number  in  the  word  column  the  word 
feature  is  being  used.  The  number  indicates  how 
many  word/part  of  speech  pairs  are  in  the  vocabu¬ 
lary  additionally  to  the  parts  of  speech. 


5.  EMOTION  AND  DOMINANCE 

Emotions  are  displayed  in  a  variety  of  gestures,  some  of 
which  are  oral  and  may  be  detected  via  automated  methods 
from  the  audio  channel  (Polzin,  1999).  Using  only  verbal 
information  the  emotions  happy,  excited  and  neutral  can 
be  detected  on  the  meeting  database  with  88.1%  accuracy 
while  always  picking  neutral  yields  83.6%.  This  result  can  be 
improved  to  88.6%  by  adding  pitch  and  power  information. 

While  these  experiments  were  conducted  at  the  utterance 
level  emotions  can  be  extended  to  topical  segments.  For 
that  purpose  the  emotions  of  the  individual  utterances  are 
entered  in  a  histogram  over  the  segment  and  the  vectors 
are  clustered  automatically.  The  resulting  clusters  roughly 
correspond  to  a  “neutral” ,  “a  little  happy”  and  “somewhat 
excited”  segment.  Using  the  classifier  for  emotions  on  the 
word  level  the  segment  can  be  classified  automatically  into 
categories  with  a  83.3%  accuracy  while  the  baseline  is  68.9%. 
The  entropy  reduction  by  automatically  detected  emotional 
activities  is  «  O.Sbit  A  similar  attempt  can  be  made  for 
dominance  (Linell  et  ah,  1988)  distributions:  Dominance  is 
easy  to  understand  for  the  user  of  an  information  access 
system  and  it  can  be  determined  automatically  with  high 
accuracy. 


“  A  similar  classihcation  result  for  emotions  on  the  utter¬ 
ance  level  has  been  obtained  by  just  using  the  laughter  vs. 
non-laughter  tokens  of  the  transcript  as  the  input.  This 
may  indicate  that  (a)  the  index  should  really  be  the  amount 
of  laughter  in  the  conversational  segment  and  that  (b)  emo¬ 
tions  might  not  be  displayed  very  overtly  in  meetings.  These 
results  however  would  require  a  wider  sampling  of  meeting 
types  to  be  generally  acceptable. 


6.  CONCLUSION  AND  FUTURE  WORK 

It  has  been  shown  that  activities  can  be  detected  and  that 
they  may  be  efficient  indices  for  access  to  oral  communica¬ 
tion.  Overall  it  is  easy  to  make  high  level  distinctions  with 
automated  methods  while  fine-grained  distinctions  are  even 
hard  to  make  for  humans  -  on  the  other  hand  automatic 
methods  are  still  able  to  model  some  aspect  of  it  (Fig.  3). 
To  obtain  an  reduction  in  entropy  a  relatively  large  database 
such  as  CallHome  Spanish  is  required  (120  dialogues).  Al¬ 
ternatives  to  activities  might  be  emotional  and  dominance 
distributions  that  are  easier  to  detect  and  that  may  be  nat¬ 
ural  to  understand  for  users.  If  activities  are  only  used  for 
local  navigation  support  within  a  rejoinder  one  could  also 
visualize  by  displaying  the  dialogue  act  patterns  for  each 
channel  on  a  time  line. 

The  author  has  also  observed  that  topic  clusters  and  activ¬ 
ities  are  largely  independent  in  the  meeting  domain  result¬ 
ing  in  orthogonal  indices.  Since  activities  have  intuitions  for 
naive  users  and  they  may  be  remembered  it  can  be  assumed 
that  users  would  be  able  to  make  use  of  these  constraints. 
Ongoing  work  includes  the  use  of  speaker  activity  for  dia¬ 
logue  segmentation  and  further  assessment  of  features  for 
information  access.  Overall  the  methods  presented  here  and 
the  ongoing  work  are  improving  the  ability  to  index  oral 
communication.  It  should  be  noted  that  some  of  the  tech¬ 
niques  presented  lend  themselves  to  implementations  that 
don’t  require  (full)  speech  recognition:  Speaker  identifica¬ 
tion  and  dialogue  act  identihcation  may  be  done  without 
an  LVCSR  system  which  would  allow  to  lower  the  compu¬ 
tational  requirements  as  well  as  to  a  more  robust  system. 
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Figure  3:  Detection  accuracy  summary:  The  detec¬ 
tion  of  high-level  genre  as  exemplified  by  the  differ¬ 
entiation  of  corpora  can  be  done  with  high  accuracy 
using  simple  features  (Ries,  1999).  Similar  it  was 
fairly  easy  to  discriminate  between  male  and  female 
speakers  on  Switchboard  (Ries,  1999).  Discrimi¬ 
nating  between  sub-genre  such  as  TV-show  types 
(Sec.  4)  can  be  done  with  reasonable  accuracy.  How¬ 
ever  it  is  a  lot  harder  to  discriminate  between  ac¬ 
tivities  within  one  conversation  for  personal  phone 
calls  (CallHome)  (Ries  et  al.,  2000)  or  for  general 
rejoinders  (Santa)  and  meetings  (Sec.  2). 
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